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ABSTRACT 

ChlP-seq is increasingly used to characterize 
transcription factor binding and chromatin marks 
at a genomic scale. Various tools are now available 
to extract binding motifs from peak data sets. 
However, most approaches are only available as 
command-line programs, or via a website but with 
size restrictions. We present peak-motifs, a 
computational pipeline that discovers motifs in 
peak sequences, compares them with databases, 
exports putative binding sites for visualization in 
the UCSC genome browser and generates an exten- 
sive report suited for both naive and expert users. 
It relies on time- and memory-efficient algorithms 
enabling the treatment of several thousand peaks 
within minutes. Regarding time efficiency, peak- 
motifs outperforms all comparable tools by several 
orders of magnitude. We demonstrate its accuracy 
by analyzing data sets ranging from 4000 to 1 28000 
peaks for 12 embryonic stem cell-specific tran- 
scription factors. In all cases, the program finds 
the expected motifs and returns additional motifs 
potentially bound by cofactors. We further apply 
peak-motifs to discover tissue-specific motifs in 
peak collections for the p300 transcriptional 
co-activator. To our knowledge, peak-motifs is the 
only tool that performs a complete motif analysis 
and offers a user-friendly web interface without 
any restriction on sequence size or number of 
peaks. 



INTRODUCTION 

ChlP-seq (1,2) has recently become a method of choice to 
study the binding preferences of transcription factors, 
as well as the localization of epigenetic regulatory marks 
at a genomic scale. The first steps of the computational 
analysis (read mapping and peak calling) typically result in 
several thousands of peak regions ranging between 200 
and 10 000 bp. Motif analysis is required to extract the 
relevant information from these regions: discover 
binding motifs that capture the binding specificity of the 
pulled-down factor and their possible co-regulators; 
compare discovered motifs to databases to predict 
associated transcription factors; predict the exact pos- 
itions of the binding sites (usually much shorter than the 
peak regions); study the binding specificity of transcrip- 
tion factors in various contexts (cell types, mutant strains 
and transcription factor isoforms). 

Specialized software tools have recently been developed 
for the analysis of ChlP-seq peaks, supporting different 
combinations of motif-related tasks (Table 1). An import- 
ant bottleneck for most existing tools is that the 
underlying algorithms were originally developed to 
discover binding motifs from a small set of co-regulated 
promoters, and can hardly treat the thousands of peaks 
produced by ChlP-seq experiments. This limitation is 
typically circumvented by restricting motif discovery to a 
few hundreds peak regions and by truncating the peaks to 
a maximal width (e.g. 100 bp) to further reduce the total 
size of the sequence set (3-5). However, given the power of 
the genome-wide experimental approach, one would like 
to be able to analyze the full data set. Some alternative 
algorithms support the analysis of large-scale data sets but 
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are only available via a Unix shell interface (6-8), or as 
MATLAB functions (9), and are thus of poor usability for 
life-science researchers. 

We have developed a computational pipeline called 
'peak-motifs', motivated by the pressing need for a statis- 
tically reliable, time-efficient and user-friendly framework 
to analyze full data sets of ChlP-seq peaks or similar data 
(ChlP-PET, ChlP-on-chip, CLIP-seq). This comprehen- 
sive pipeline takes as input a set of peak sequences, 
discovers exceptional motifs, compares them with motif 
databases, predicts binding site positions and returns a 
structured HTML report with direct links to visualization 
in the UCSC genome browser (Figure 1). This tool can 
also be used for differential analyses, where two datasets 
are given as input (e.g. test versus control, or peaks from 
two experimental conditions), to discover motifs specific 
to one of the datasets. 

We first show that this motif discovery approach is 
significantly faster than other available alternatives, 
thereby allowing processing of comprehensive ChlP-seq 
data sets, even from the web server. We then demonstrate 
the biological relevance of the motifs discovered by our 
pipeline with two study cases, highlighting the benefit of 
analyzing complete datasets and using complementary 
approaches for motif discovery. 

MATERIALS AND METHODS 

The motif discovery step relies on a combination of 
tried-and-tested algorithms integrated in the software 
suite regulatory sequence analysis tools (RSAT, http:// 
rsat.ulb.ac.be/rsat/) (10-12), which use complementary 
criteria to detect exceptional words (oligonucleotides and 
spaced motifs): global over-representation of oligonucleo- 
tides (oligo- analysis) or spaced pairs (dyad-analysis), 
heterogeneous positional distribution (position-analysis) 
and local over-representation (local-word-analysis) 
(12-15). 

The motif comparison step is performed by compare- 
matrices (12), which supports a wide range of scoring 
metrics and displays the results as multiple alignments of 
logos, enabling to grasp the similarities between a dis- 
covered motif and several known motifs. This feature is 
particularly valuable to reveal adjacent fragments of the 
discovered motif showing similarities with two distinct 
known motifs, suggesting a bipartite motif for two 
factors (see the SOCT motif in Figure 4 and below). 

As the individual components of the workflow have 
been described previously (12), we briefly explain here 
the choice of parameters for the different steps of 
peak-motifs analyses. The full list of commands and 
parameters are automatically reported at the end of each 
peak-motifs report. The parameters used for the case 
studies are available in the peak-motifs reports on the 
supporting website (http://rsat.bigre.ulb.ac.be/~rsat/ 
supp_material_peak-motifs/). 

Motif discovery 

Word-based analysis is performed with hexanucleotides 
(k = 6) and heptanucleotides (k = 7). The significance 



tests underlying pattern detection ensure a control of the 
rate of false positives, with suitable multi-testing correc- 
tions. The motif discovery algorithms support higher 
order background models, which are of particular import- 
ance for modeling genomic sequences of vertebrates. 
For oligo- analysis, expected word frequencies were 
estimated with a Markov model of order m = k — 2, 
trained in the peak sequences. The website also allows to 
select lower order Markov models, which are less stringent 
but achieve a higher sensitivity with small data sets. For 
differential analysis, the expected frequency of each £-mer 
is estimated by taking the observed frequency of the same 
k-mer in the control set. Significant words are assembled 
using 'pattern-assembly' and converted to position-specific 
scoring matrices with matrix-from-p at terns. 

Motif comparison 

Discovered motifs are compared (using compare-matrices) 
to one or several databases of known transcription factor 
binding motifs. The website directly supports comparisons 
with JASPAR (16), UNIPROBE (17), REGULONDB 
(18) and Drosophila-specific collections (19), thus 
providing a vast choice of known motifs, for a wide 
range of organisms. Personal or license-protected motif 
collections can also be uploaded. Several metrics are 
computed to measure the similarity between each matrix 
pair (Pearson correlation, width normalized correlation, 
logo dot product, correlation of information content, 
normalized Sandelin-Wasserman, sum of squared dis- 
tances and normalized Euclidian similarity). As these 
metrics span over very different ranges, we convert them 
to ranks and compute a mean rank in order to obtain 
a robust comparison metrics. 

Matrix scanning 

Peak sequences are scanned to predict binding sites with 
the program matrix-scan, using as background model 
a Markov chain of order 1 trained on the peak sequences 
themselves. Noteworthy, a Markov order m > 1 is required 
to account for the CpG avoidance observed in vertebrate 
genomes, and for other types of context-dependent residue 
probabilities. Predicted binding sites are mapped onto the 
genome (convert-features) and exported as BED files to 
be automatically loaded as custom tracks on the UCSC 
genome browser. 

RESULTS 

Peak-motifs processes full-sized ChlP-seq data sets in 
a few minutes 

We assessed the time efficiency of peak-motifs by 
analyzing data sets of increasing sizes (from 100 to 
1000 000 peaks of 100 bp each), with total sequence 
sizes ranging from lOkb to 100 Mb The computing 
time of the motif discovery algorithms integrated in 
peak-motifs increases linearly with sequence size and 
outperforms all the other existing motif discovery tools 
used in this comparison (Figure 2, Supplementary File 
SI). Data sets of several tens of megabytes are 
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Figure 1. Schematic flow chart of the peak-motifs pipeline. For sake of clarity, only the main analysis steps are depicted. The pipeline takes as input 
a set of peak sequences, and runs several de novo motif discovery algorithms based on different detection criteria: over-representation, differential 
representation (test versus control), global position bias or local over-representation along the centered peaks. Transcription factors are predicted by 
matching discovered motifs against several public motif databases and/or against user-uploaded motif collections. Peak sequences are scanned with 
the discovered motifs to predict precise binding positions. These positions are then automatically exported as an annotation track for UCSC genome 
browser, thus enabling a flexible visualization in their genomic context. 



processed in a few minutes on a personal computer (the 
most efficient tool, oligo- analysis, treats 100 Mb in 
3min). This linear time response enables peak-motifs 
to scale up efficiently with sequence size, and allows 
us to provide an easy access via a web interface, 



without any data size restriction. This moreover gives 
us the possibility to run four distinct algorithms in 
order to detect motifs of various types (oligonucleotides, 
spaced pairs) based on complementary criteria 
(over-representation, positional heterogeneity). 
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Figure 2. Time efficiency of motif discovery algorithms integrated in 
peak-motifs (plain lines) compared to alternative algorithms (dotted 
lines). The abscissa indicates sequence sizes, the ordinate processing 
times. The programs oligo-, dyad-, position-analysis and DREME 
show a linear time complexity (the power is ~1), ChlPMunk has 
a quasi-linear complexity (power 1.27) and MEME a more than 
quadratic complexity (power 2.21). See Supplementary File SI for the 
detailed analysis. 



Analysis of the ChlP-seq peak sets for 12 DNA-binding 
transcription factors involved in mouse ES cell 
pluripotency and self-renewal 

To evaluate the accuracy of the predicted motifs, we 
analyzed the ChlP-seq peak sets for 12 DNA-binding 
transcription factors involved in mouse embryonic stem 
cell pluripotency and self-renewal (20). The read sequences 
were downloaded from the Gene Expression Omnibus 
website and mapped with Bowtie (21) on the mouse 
mm9 assembly. Peak regions were extracted from reads 
using MACS (22) with a false discovery rate threshold 
of 0.2, and processed with PeakSplitter (23) to obtain 
actual peaks. For the Smadl data set, MACS did not 
return a single peak with the selected parameters. 
We, therefore, used the peaks from the initial data set 
GSM288348, which contains 1084 ChlP-seq peaks for 
the Smadl factor. The other data sets comprise between 
4249 peaks for Stat3 (totaling 1.4 Mb) and 1 28 469 peaks 
for Esrrb (36.6 Mb). 

For each of the 12 tested factors, peak-motifs 
discovered the correct motif (Figure 3). The relevant 
motifs were generally detected independently by several 
of the four algorithms, indicating that they are not only 
over-represented (pligo- analysis, dyad- analysis) but also 
positionally biased around peak centers (local-word- 
analysis, position- analysis). For several peak sets, recent 
studies (5,24) using novel motif-finding programs 
returned more accurate motifs than the original study, 
which was restricted to the 500 top-scoring peaks. Our 
comprehensive analysis also returned more accurate 



motifs than the original study, and performed as well or 
better than other recent motif-finding programs, as 
detailed below. 

In the Sox2 and Oct4 peak sets, peak-motifs found not 
only the composite 'SOCT motif bound by the Sox2/Oct4 
complex (reported by Chen and co-workers), but also 
the distinct motifs recognized by Sox2 (CTCTTTGTT) 
and Oct4 (ATGyAAAt), respectively (Figure 4, top). 
Interestingly, in the Oct4 data set, unknown motifs were 
returned with a high significance, (i.e. motifs with no 
significant similarity with the consensus encompassed by 
the common databases). Such motifs may reveal alterna- 
tive consensus, as in the case of the motif crTATGCGCA 
TAyg, which actually corresponds to an alternative 
Oct4 motif, also detected in other recent studies (5,24). 

As discussed by Chen and co-workers, Nanog and 
Smadl frequently bind the same regions as Sox2/Oct4, 
which raises a particular difficulty for motif discovery. 
Indeed, their analysis of the Nanog peak set returned 
a Sox2-like motif instead of the Nanog binding motif. 
Subsequently, this Sox2-like motif was erroneously 
annotated as Nanog binding in the TRANSFAC 
database (matrix V$NANOG_02), although its consensus 
(CYWTTGTTNT) clearly differs from the previously 
annotated Nanog consensus (GGGNCCATTKCC, 
TRANSFAC matrix V$NANOG_01). The prevalent 
motif discovered by peak-motifs in Nanog peaks corres- 
ponds to the SOCT binding motif, while the canonical 
Nanog motif is not found. However, peak-motifs reports 
a motif (sCGCmaTCAbg) that is not similar to any 
motif found in the databases (Figure 4, middle). 
A similar motif with a ccAT(C/T)A core was also 
reported by Bailey (5), and actually corresponds to an 
experimentally validated alternative Nanog motif (25). 

For the Smadl factor, the peak size distribution of the 
original data set seems to be biased toward very small 
peaks (smallest peak is lbp, mean size id 30 bp); neverthe- 
less, peak-motifs was able to discover a motif agAAACA 
AAGCmar that matches the canonical Smadl motif 
(V$SMAD1_01 ad A A AC AA AGcm) . In addition, 
several other discovered motifs match a Sox-like motif 
wGAACAATAga, confirming the frequent co-binding 
of Smadl and Sox. 

In the E2fl peak set, peak-motifs discovered sev- 
eral motifs matching the generic E2F consensus 
(GGCGsg, matrix V$E2F_Q2) but distinct from the 
E2fl -specific consensus (TTTsGCGG, in Transfac 
matrix V$E2F1_Q4; TTTsGCGC in JASPAR matrix 
MA0024.1) (Figure 4, bottom). Whereas no E2fl motif 
was detected in the original study by Chen and co-workers 
(20), an E2f-like motif similar to ours was reported in 
ref. (5). 

In summary, our analysis of the 1 2 peak sets from Chen 
and co-workers significantly improved motifs as compared 
to the original study, highlighting the value of applying 
motif discovery to full-size data sets. Remarkably, in 
addition to the motifs corresponding to the transcription 
factors targeted by the experiments, peak-motifs also 
returned several motifs corresponding to transcription 
factors presumably involved in the same regulatory 
pathways. 
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Figure 4. Logos of the motifs discovered by peak-motifs for the factors 
Oct4, Sox2, Nanog and E2fl adapted from the ChlP-seq data set by 
Chen et al. (20). 



Analysis of the ChlP-seq peak sets for p300 in four 
different mouse embryonic tissues 

Beyond the analysis of motif-specific DNA-binding tran- 
scription factors, the ChlP-seq approach can be used to 
characterize binding profiles of epigenetic regulators, 
chromatin marks and generic cofactors. In contrast with 
the transcription factors analyzed above, such cofactors 
do not recognize specific DNA motifs, but interact with 
various specific DNA-binding transcription factors and 
facilitate the activation of their target genes by modifying 
DNA structure. Genome- wide location analyzes of the 
generic cofactor p300 have been performed to reveal 
regions transcriptionally active in different tissues during 
embryonic development (26,27). Since the DNA regions 
identified by this approach likely contain binding sites for 
the transcription factors specifically active in the analyzed 
tissues and developmental stages, we wondered if 
peak-motifs would be able to detect the corresponding 
motifs. In this respect, we used peak-motifs to detect 
motifs from the ChlP-seq peaks of the generic 
enhancer-associated p300 cofactor. In the two aforemen- 
tioned studies, binding profiles of this cofactor were 
characterized in several embryonic mouse tissues (heart, 
midbrain, forebrain and limb) and some binding regions 
were validated as tissue-specific enhancers. However, the 
transcription factors bound to those enhancers remain 
unknown. 



We retrieved the peak locations for all four tissues. 
By running peak-motifs in the p300 peak sets in each of 
these four tissues, we were able to identify motifs poten- 
tially bound by tissue-specific regulators, as well as some 
motifs common to all four data sets, probably correspond- 
ing to ubiquitous activators (Supplementary File S2). 
Peak-motifs compared these discovered motifs to motifs 
of known factors stored in databases, including Transfac, 
JASPAR and UniProbe. Tissue-specific motifs include 
a motif found in the limb data set alone, which matches 
the consensus of Hox9, known to be involved in limb 
development (28). We also identify a GATA motif 
specific to the heart data set, which presumably points 
to a key factor of the cardiac gene regulatory network. 
As a validation of these predictions, we verified that the 
predicted transcription factors are indeed expressed in 
the corresponding tissues, using expression data from 
the MGI database (29) (Supplementary File S2). 

For further validation, we analyzed data generated by 
ChlP-seq experiments targeting various heart-specific 
transcription factors (Mef2, SRF, GATA4, Nkx2.5) in 
the mouse HL1 cardiomyocyte cell line. Predicted motifs 
for these data sets strengthen our findings (Figure 5): the 
predicted GATA motif from the p300 heart data set 
clusters with similar motifs obtained from the GATA4 
data set. Similarly, several motifs obtained in HL1 data 
sets cluster with the set of motifs from the p300 data 
sets matching the Mef2 consensus, giving insight into 
the highly combinatorial nature of cardiogenesis. We 
also found two 'ubiquitous' motifs significantly over- 
represented in all four data sets. The first is a C-rich 
motif, which matches the binding motif of Spl, consistent 
with the fact that Spl functionally interacts with the 
acetylase domain of p300 (30,31). The second motif 
matches the Mef2 consensus (ATTTTTA). Interestingly, 
Mef2 is known to be involved not only in muscle forma- 
tion, explaining its presence in the heart and limb data set, 
but also in CNS development (in particular neuron 
differentiation). 

The relevance of the discovered motifs opens the 
exciting prospect of predicting which transcription 
factors and enhancers are active in a given tissue and/or 
at a given developmental stage, by discovering specific TF 
motifs in the peaks pulled down by generic cofactors such 
as p300. 

Peak-motifs is accessible through a user-friendly web 
interface 

The simplest way to use peak-motifs is via its user-friendly 
web interface, where all parameters (background 
models, word lengths, etc.) are pre-selected according to 
the optimal conditions found from our study cases. The 
only required input is the set of peak sequences. A second 
set of peak sequences can also be provided to serve as 
background for differential analyses (treatment versus 
control). Although peak-motifs is designed to process 
full data sets, the interface offers the possibility to easily 
reduce the analysis to a subset of top sequences, or yet to 
clip peaks at a maximal size from their centers, thereby 
reducing the need for data manipulation on the user 
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side. The web page is documented with a manual 
providing detailed information about each option. 
A 'demo' button fills up the form with a typical test set. 
A tutorial further guides new users through choices of 
parameters and explains how to interpret the results. 
In addition to its website access, peak-motifs can be 
used as a stand-alone application (Unix shell), as well 
as SOAP/WSDL web services (thereby enabling bioinfor- 
maticians to automate its use, without installing it on 
their machine). 

A particular effort has been made to generate a clear 
and easily interpretable output for less-advanced users, 
while providing links to the raw results for the expert 
users. All result files are presented in standard formats 
and are downloadable as an archive along with the 
summary web page, to allow further analysis with 
third-party software. To our knowledge, peak-motifs is 
the only ChlP-seq pipeline offering direct visualization 
of the predicted binding sites as custom tracks in 
the UCSC genome browser. This feature is of prime 
importance to interpret the results in light of the 
genomic annotation, in order to plan experiments for 
further validation of the results. 



DISCUSSION 

Peak-motifs is a comprehensive pipeline to efficiently 
discover motifs and identify putative transcription 
factors in ChlP-seq and related data sets. We 
demonstrated its biological validity by recovering the 
correct motifs from 12 ChlP-seq sets corresponding to 
known transcription factors (20). We also performed an 



original analysis of the binding profiles of the generic 
cofactor p300 (26), which led us to predict specific 
motifs and transcription factors that are active in 
specific tissues at specific developmental stages. Our 
benchmarks showed that for large data sets peak-motifs 
outperforms its most serious competitors by a factor of at 
least 100, allowing us to analyze full data sets in a matter 
of minutes. This time efficiency enables an interactive web 
access for comprehensive data sets, thereby constituting 
a convenient tool for ChlP-seq data analyses even for 
naive users. This tool will be of broad interest to 
the increasing community of experimentalists and 
bioinformaticians who are confronted to the challenging 
issue of extracting interpretable information from the 
massive amounts of data resulting from next generation 
sequencing. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Files 1 and 2. 
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